Goto

Collaborating Authors

 visual question





Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs

Neural Information Processing Systems

Vision Language Models (VLMs) demonstrate remarkable proficiency in addressing a wide array of visual questions, which requires strong perception and reasoning faculties. Assessing these two competencies independently is crucial for model refinement, despite the inherent difficulty due to the intertwined nature of seeing and reasoning in existing VLMs. To tackle this issue, we present Prism, an innovative framework designed to disentangle the perception and reasoning processes involved in visual question solving. Prism comprises two distinct stages: a perception stage that utilizes a VLM to extract and articulate visual information in textual form, and a reasoning stage that formulates responses based on the extracted visual information using a Large Language Model (LLM). This modular design enables the systematic comparison and assessment of both proprietary and open-source VLM for their perception and reasoning strengths. Our analytical framework provides several valuable insights, underscoring Prism's potential as a cost-effective solution for vision-language tasks.By combining a streamlined VLM focused on perception with a powerful LLM tailored for reasoning, Prism achieves superior results in general vision-language tasks while substantially cutting down on training and operational expenses. Quantitative evaluations show that Prism, when configured with a vanilla 2B LLaVA and freely accessible GPT-3.5, delivers performance on par with VLMs $10 \times$ larger on the rigorous multimodal benchmark MMStar.


Understanding Information Storage and Transfer in Multi-Modal Large Language Models

Neural Information Processing Systems

Understanding the mechanisms of information storage and transfer in Transformer-based models is important for driving model understanding progress. Recent work has studied these mechanisms for Large Language Models (LLMs), revealing insights on how information is stored in a model's parameters and how information flows to and from these parameters in response to specific prompts. However, these studies have not yet been extended to Multi-modal Large Language Models (MLLMs). Given their expanding capabilities and real-world use, we start by studying one aspect of these models -- how MLLMs process information in a factual visual question answering task. We use a constraint-based formulation which views a visual question as having a set of visual or textual constraints that the model's generated answer must satisfy to be correct (e.g.


AVIS: Autonomous Visual Information Seeking with Large Language Model Agent

Neural Information Processing Systems

In this paper, we propose an autonomous information seeking visual question answering framework, AVIS. Our method leverages a Large Language Model (LLM) to dynamically strategize the utilization of external tools and to investigate their outputs via tree search, thereby acquiring the indispensable knowledge needed to provide answers to the posed questions. Responding to visual questions that necessitate external knowledge, such as What event is commemorated by the building depicted in this image?, is a complex task. This task presents a combinatorial search space that demands a sequence of actions, including invoking APIs, analyzing their responses, and making informed decisions. We conduct a user study to collect a variety of instances of human decision-making when faced with this task. This data is then used to design a system comprised of three components: an LLM-powered planner that dynamically determines which tool to use next, an LLM-powered reasoner that analyzes and extracts key information from the tool outputs, and a working memory component that retains the acquired information throughout the process. The collected user behavior serves as a guide for our system in two key ways. First, we create a transition graph by analyzing the sequence of decisions made by users.


ZeShot-VQA: Zero-Shot Visual Question Answering Framework with Answer Mapping for Natural Disaster Damage Assessment

Karimi, Ehsan, Rahnemoonfar, Maryam

arXiv.org Artificial Intelligence

Natural disasters usually affect vast areas and devastate infrastructures. Performing a timely and efficient response is crucial to minimize the impact on affected communities, and data-driven approaches are the best choice. Visual question answering (VQA) models help management teams to achieve in-depth understanding of damages. However, recently published models do not possess the ability to answer open-ended questions and only select the best answer among a predefined list of answers. If we want to ask questions with new additional possible answers that do not exist in the predefined list, the model needs to be fin-tuned/retrained on a new collected and annotated dataset, which is a time-consuming procedure. In recent years, large-scale Vision-Language Models (VLMs) have earned significant attention. These models are trained on extensive datasets and demonstrate strong performance on both unimodal and multimodal vision/language downstream tasks, often without the need for fine-tuning. In this paper, we propose a VLM-based zero-shot VQA (ZeShot-VQA) method, and investigate the performance of on post-disaster FloodNet dataset. Since the proposed method takes advantage of zero-shot learning, it can be applied on new datasets without fine-tuning. In addition, ZeShot-VQA is able to process and generate answers that has been not seen during the training procedure, which demonstrates its flexibility.


Saliency Guided Longitudinal Medical Visual Question Answering

Wu, Jialin, Liu, Xiaofeng

arXiv.org Artificial Intelligence

Longitudinal medical visual question answering (Diff-VQA) requires comparing paired studies from different time points and answering questions about clinically meaningful changes. In this setting, the difference signal and the consistency of visual focus across time are more informative than absolute single-image findings. We propose a saliency-guided encoder-decoder for chest X-ray Diff-VQA that turns post-hoc saliency into actionable supervision. The model first performs a lightweight near-identity affine pre-alignment to reduce nuisance motion between visits. It then executes a within-epoch two-step loop: step 1 extracts a medically relevant keyword from the answer and generates keyword-conditioned Grad-CAM on both images to obtain disease-focused saliency; step 2 applies the shared saliency mask to both time points and generates the final answer. This closes the language-vision loop so that the terms that matter also guide where the model looks, enforcing spatially consistent attention on corresponding anatomy. On Medical-Diff-VQA, the approach attains competitive performance on BLEU, ROUGE-L, CIDEr, and METEOR while providing intrinsic interpretability. Notably, the backbone and decoder are general-domain pretrained without radiology-specific pretraining, highlighting practicality and transferability. These results support saliency-conditioned generation with mild pre-alignment as a principled framework for longitudinal reasoning in medical VQA.


WearVQA: A Visual Question Answering Benchmark for Wearables in Egocentric Authentic Real-world scenarios

Chang, Eun, Huang, Zhuangqun, Liao, Yiwei, Bhavsar, Sagar Ravi, Param, Amogh, Stark, Tammy, Ahmadyan, Adel, Yang, Xiao, Wang, Jiaqi, Abdullah, Ahsan, Nguyen, Giang, Iyer, Akil, Hall, David, Li, Elissa, Moon, Shane, Scheffer, Nicolas, Ahmed, Kirmani, Damavandi, Babak, Wanga, Rakesh, Kumar, Anuj, Patel, Rohit, Dong, Xin Luna

arXiv.org Artificial Intelligence

We introduce WearVQA, the first benchmark specifically designed to evaluate the Visual Question Answering (VQA) capabilities of multi-modal AI assistant on wearable devices like smart glasses. Unlike prior benchmarks that focus on high-quality, third-person imagery, WearVQA reflects the unique challenges of egocentric interaction--where visual inputs may be occluded, poorly lit, unzoomed, or blurry, and questions are grounded in realistic wearable use cases. The benchmark comprises 2,520 carefully curated image-question-answer triplets, spanning 7 diverse image domains including both text-centric and general scenes, 10 cognitive task types ranging from basic recognition to various forms of reasoning, and 6 common wearables-specific image quality issues. All questions are designed to be answerable using only the visual input and common senses. WearVQA is paired with a rigorous LLM-as-a-judge evaluation framework with 96% labeling accuracy. Open-source and proprietary multi-modal LLMs achieved a QA accuracy as low as 24-52% on WearVQA, with substantial drops on lower-quality images and reasoning-heavy tasks.


Think First, Assign Next (ThiFAN-VQA): A Two-stage Chain-of-Thought Framework for Post-Disaster Damage Assessment

Karimi, Ehsan, Le, Nhut, Rahnemoonfar, Maryam

arXiv.org Artificial Intelligence

--Timely and accurate assessment of damages following natural disasters is essential for effective emergency response and recovery. Recent AI-based frameworks have been developed to analyze large volumes of aerial imagery collected by Unmanned Aerial V ehicles (UA Vs), providing actionable insights rapidly. However, creating and annotating data for training these models is costly and time-consuming, resulting in datasets that are limited in size and diversity. Furthermore, most existing approaches rely on traditional classification-based frameworks with fixed answer spaces, restricting their ability to provide new information without additional data collection or model retraining. Using pre-trained generative models built on in-context learning (ICL) allows for flexible and open-ended answer spaces. However, these models often generate hallucinated outputs or produce generic responses that lack domain-specific relevance. T o address these limitations, we propose Think First, Assign Next (ThiF AN-VQA), a two-stage reasoning-based framework for Visual Question Answering (VQA) in disaster scenarios. ThiF AN-VQA first generates structured reasoning traces using chain-of-thought (CoT) prompting and ICL to enable interpretable reasoning under limited supervision. A subsequent answer selection module evaluates the generated responses and assigns the most coherent and contextually accurate answer, effectively improve the model performance. Experiments on FloodNet and RescueNet-VQA, UA V-based datasets from flood-and hurricane-affected regions, demonstrate that ThiF AN-VQA achieves superior accuracy, interpretability, and adaptability for real-world post-disaster damage assessment tasks. N the immediate aftermath of natural disasters, first responders rely heavily on up-to-date information to assess damage, identify hazards, allocate resources, and reach survivors as quickly as possible.